Learning in Natural Language: Theory and Algorithmic Approaches
نویسنده
چکیده
This article summarizes work on developing a learning theory account for the major learning and statistics based approaches used in natural language processing. It shows that these approaches can all be explained using a single distribution free inductive principle related to the pac model of learning. Furthermore, they all make predictions using the same simple knowledge representation a linear representation over a common feature space. This is significant both to explaining the generalization and robustness properties of these methods and to understanding how these methods might be extended to learn from more structured, knowledge intensive examples, as part of a learning centered approach to higher level natural language inferences. 1 I n t r o d u c t i o n Many important natural language inferences can be viewed as problems of resolving phonetic, syntactic, semantics or pragmatics ambiguities, based on properties of the surrounding context. It is generally accepted that a learning component must have a central role in resolving these context sensitive ambiguities, and a significant amount of work has been devoted in the last few years to developing learning methods for these tasks, with considerable success. Yet, our understanding of when and why learning works in this domain and how it can be used to support increasingly higher level tasks is still lacking. This article summarizes work on developing a learning theory account for the major learning approaches used in NL. While the major statistics based methods used in NLP are typically developed with a * This research is supported by NSF grants IIS-9801638, SBR-9873450 and IIS-9984168. Bayesian view in mind, the Bayesian principle cannot directly explain the success and robustness of these methods, since their probabilistic assumptions typically do not hold in the data. Instead, we provide this explanation using a single, distribution free inductive principle related to the pac model of learning. We describe the unified learning framework and show that, in addition to explaining the success and robustness of the statistics based methods, it also applies to other machine learning methods, such as rule based and memory based methods. An important component of the view developed is the observation that most methods use the same simple knowledge representation. This is a linear representation over a new feature space a transformation of the original instance space to a higher dimensional and more expressive space. Methods vary mostly algorithmicly, in ways they derive weights for features in this space. This is significant both to explaining the generalization properties of these methods and to developing an understanding for how and when can these methods be extended to learn from more structured, knowledge intensive examples, perhaps hierarchically. These issues are briefly discussed and we emphasize the importance of studying knowledge representation and inference in developing a learning centered approach to NL inferences. 2 Learning Frameworks Generative probability models provide a principled way to the study of statistical classification in complex domains such as NL. It is common to assume a generative model for such data, est imate its parameters from training data and then use Bayes rule to obtain a classifier for this model. In the context of NL most classifters are derived from probabilistic language models which estimate the probability of a sentence 8 using Bayes rule, and then decompose this probability into a product of conditional probabilities according to the generative model. Pr(s) = Pr(wl, W 2 , . . . Wn) ---= H~=lPr(wilwl,...wi-1) = H~=lPr(wilhi) where hi is the relevant history when predicting wi, and s is any sequence of tokens, words, partof-speech (pos) tags or other terms. This general scheme has been used to derive classifiers for a variety of natural language applications including speech applications (Rab89), pos tagging (Kup92; Sch95), word-sense ambiguation (GCY93) and contextsensitive spelling correction (Go195). While the use of Bayes rule is harmless, most of the work in statistical language modeling and ambiguity resolution is devoted to estimating terms of the form Pr(wlh ). The generative models used to estimate these terms typically make Markov or other independence assumptions. It is evident from studying language data that these assumptions are often patently false and that there are significant global dependencies both within and across sentences. For example, when using (Hidden) Markov Model (HMM) as a generative model for pos tagging, estimating the probability of a sequence of tags involves assuming that the pos tag ti of the word wi is independent of other words in the sentence, given the preceding tag ti-1. It is not surprising therefore that this results in a poor estimate of the probability density function. However, classifiers built based on these false assumptions nevertheless seem to behave quite robustly in many cases. A different, distribution free inductive principle that is related to the pac model of learning is the basis for the account developed here. In an instance of the agnostic variant of pac learning (Val84; Hau92; KSS94), a learner is given data elements (x, l) that are sampled according to some fixed but arbitrary distribution D on X x {0, 1}. X is the instance space and I E {0, 1} is the label 1. D may simply reflect the distribution of the data as it occurs "in nature" (including contradictions) without assuming that the labels are generated according to some "rule". Given a sample, the goal of the learning algorithm is to eventually output a hypothesis h from some hypothesis class 7/ that closely approximates the data. The 1The model can be extended to deal with any discrete or continuous range of the labels. true error of the hypothesis h is defined to be errorD(h) = Pr(x,O~D[h(x) 7~ If, and the goal of the (agnostic) pac learner is to compute, for any distribution D, with high probability (> 1 5 ) , a hypothesis h E 7/ with true error no larger than ~ + inffhenerrorD(h). In practice, one cannot compute the true error errorD(h). Instead, the input to the learning algorithm is a sample S = {(x i,l)}i=li m of m labeled examples and the learner tries to find a hypothesis h with a small empirical error errors(h) = I{x e Slh(x) ¢ l}l/ISl, and hopes that it behaves well on future examples. The hope that a classifier learned from a training set will perform well on previously unseen examples is based on the basic inductive principle underlying learning theory (Val84; Vap95) which, stated informally, guarantees that if the training and the test data are sampled from the same distribution, good performance on large enough training sample guarantees good performance on the test data (i.e., good "true" error). Moreover, the quality of the generalization is inversely proportional to the expressivity of the class 7-/. Equivalently, for a fixed sample size IsI, the quantified version of this principle (e.g. (Hau92)) indicates how much can one count on a hypothesis selected according to its performance on S. Finally, notice the underlying assumption that the training and test data are sampled from the same distribution; this framework addresses this issue. (See (GR99).) In our discussion functions learned over the instance space X are not defined directly over the raw instances but rather over a transformation of it to a feature space. A feature is an indicator function X : X ~ {0, 1} which defines a subset of the instance space all those elements in X which are mapped to 1 by XX denotes a class of such functions and can be viewed as a transformation of the instance space; each example (Xl , . . . xn) E X is mapped to an example (Xi, . . .Xlxl) in the new space. We sometimes view a feature as an indicator function over the labeled instance space X x {0, 1} and say that X(x, l) = 1 for examples x E x ( X ) with label l. 3 E x p l a i n i n g P r o b a b i l i s t i c M e t h o d s Using the abovementioned inductive principle we describe a learning theory account that explains the success and robustness of statistics based classifiers (Rot99a). A variety of methods used for learning in NL are shown to make their predict ion using Linear Statistical Queries (LSQ) hypotheses. This is a family of linear predictors over a set of features which are directly related to the independence assumptions of the probabilistic model assumed. The success of these classification methods is then shown to be due to the combinat ion of two factors: • Low expressive power of the derived classifier. • Robustness propert ies shared by all linear statistical queries hypotheses. Since the hypotheses are computed over a feature space chosen so tha t they perform well on training data, learning theory implies tha t they perform well on previously unseen data, irrespective of whether the under lying probabilistic
منابع مشابه
Applying Earlier Literacy Research in Iran to Current Literacy Theory and Policy
In this paper, I attempt to bring together approaches to literacy in theory and in practice, drawing upon various activities I have been involved in over the years–research in Iranian villages during the 1970s; linking research and theory to literacy policy, with particular reference to a contribution to the Unesco Global Monitoring Report in 2004 and involvement in an ongoing adult literacy tr...
متن کاملComparing Experiential Approaches: Structured Language Learning Experiences versus Conversation Partners for Changing Pre-Service Teacher Beliefs
Research has shown that language teachers’ beliefs are often difficult to change through education. Experiential learning may help, but more research is needed to understand how experiential approaches shape perceptions. This study compares two approaches, conversation partners (CONV) and structured language learning experiences (SLLE), integrated into a course in language acquisition. Partici...
متن کاملFocus on Form Instruction in EFL: Iimplications for Theory and Practice
Language teachers usually face issues regarding the most effective methods of teaching. Teaching language to nonnative speakers of English involves certain problems and challenges at all levels of instruction. Due to the unsatisfactory results of focus on forms and focus on meaning instructions and their inevitable inadequacies, focus on form instruction along with its multiple techniques are r...
متن کاملJALDA's Interview with Professor James P. Lantolf
James P. Lantolf is George and Jane Greer Professor Emeritus of Language Acquisition and Applied Linguistics and former director of the Center for Advanced Language Proficiency Education and Research at the Pennsylvania State University, USA. He is currently Adjunct Professor of Applied Linguistics in the same academic unit at Xi’an Jiaotong University. He is founder of the Sociocultural Theory...
متن کاملTeaching approaches to Computer Assisted Language Learning
Computers have been used for language teaching ever since the 1960's.Learning a second language is a challenging endeavor, and, for decades now, proponents of computer assisted language learning (CALL) have declared that help is on the horison. We investigate the suitability of deploying speech technology in computer based systems that can be used to teach foreign language skills. In this case,...
متن کاملIranian EFL Learners’ Perception of the Efficacy and Affordance of Activity Theory-based Computer Assisted Language Learning in Writing Achievement
Second language writing instruction has been greatly influenced by the growing importance of technology and the recent shift of paradigm from a cognitive to a social orientation in second language acquisition (Lantolf & Thorne, 2006). Therefore, the applications of computer assisted language learning and activity theory have been suggested as a promising framework for writing studies. The prese...
متن کامل